The salaries that are available in Salaries2017.CSV were downloaded from USA Today at https://www.usatoday.com/sports/mlb/salaries/2017/player/all/.
sal2017 = read.csv("Salaries2017.csv", header=TRUE)
head(sal2017)
## Rank Name Team POS Salary Years Total.Value
## 1 1 Clayton Kershaw LAD SP 33000000 7 (2014-20) 215000000
## 2 2 Zack Greinke ARI SP 31876966 6 (2016-21) 206500000
## 3 3 David Price BOS SP 30000000 7 (2016-22) 217000000
## 4 4 Miguel Cabrera DET 1B 28000000 10 (2014-23) 292000000
## 5 4 Justin Verlander DET SP 28000000 7 (2013-19) 180000000
## 6 6 Jason Heyward CHC RF 26055288 8 (2016-23) 184000000
## Avg.Annual Source
## 1 30714286 https://www.usatoday.com/sports/mlb/salaries/2017/player/all/
## 2 34416666 10/1/2017
## 3 31000000
## 4 29200000
## 5 25714285
## 6 23000000
Numerical and graphical descriptives of the population (2017 MLB players) can be generated. We will focus on the annual salary for the year 2017.
N = nrow(sal2017)
summary(sal2017$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 535000 545500 1562500 4468069 6000000 33000000
mean(sal2017$Salary)
## [1] 4468069
sd(sal2017$Salary)
## [1] 5948459
hist(sal2017$Salary)
boxplot(sal2017$Salary)
Suppose that we now sample from the population as if we were unable to get information for all 868 players who were on the initial 2017 rosters. The sample statistics should be representative of the population values (parameters).
n = 30
smplIDs = sample(1:N,n)
smpl = sal2017[sort(smplIDs),]
head(smpl)
## Rank Name Team POS Salary Years Total.Value Avg.Annual
## 34 34 Mike Trout LAA CF 20083333 6 (2015-20) 144500000 24083333
## 63 63 Jake Arrieta CHC SP 15637500 1 (2017) 15637500 15637500
## 83 83 J.J. Hardy BAL SS 13636781 3 (2015-17) 40000000 13333333
## 92 90 Josh Reddick HOU RF 13000000 4 (2017-20) 52000000 13000000
## 111 104 Todd Frazier CWS 3B 12000000 1 (2017) 12000000 12000000
## 164 164 Brett Cecil STL RP 7750000 4 (2017-20) 30500000 7625000
## Source
## 34
## 63
## 83
## 92
## 111
## 164
summary(smpl$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 535000 549875 1090000 3851130 3750000 20083333
mean(smpl$Salary)
## [1] 3851130
sd(smpl$Salary)
## [1] 5400181
hist(smpl$Salary)
boxplot(smpl$Salary)
smpl1 = smpl
If we repeat the process, it is unlikely that we will get the same values for the statistics. This is not surprising since it is unlikely that our new sample will contain the same 30 players that the original sample contained.
n = 30
smplIDs = sample(1:N,n)
smpl = sal2017[sort(smplIDs),]
head(smpl)
## Rank Name Team POS Salary Years Total.Value Avg.Annual
## 27 27 Chris Davis BAL 1B 21233006 7 (2016-22) 161000000 23000000
## 43 43 Hunter Pence SF RF 18700000 5 (2014-18) 90000000 18000000
## 83 83 J.J. Hardy BAL SS 13636781 3 (2015-17) 40000000 13333333
## 84 84 Bryce Harper WSH RF 13625000 1 (2017) 13625000 13625000
## 108 104 Chris Sale BOS SP 12000000 5 (2013-17) 32500000 6500000
## 167 167 Dee Gordon MIA 2B 7742202 5 (2016-20) 50000000 10000000
## Source
## 27
## 43
## 83
## 84
## 108
## 167
summary(smpl$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 535000 545250 1650000 4398326 4450000 21233006
mean(smpl$Salary)
## [1] 4398326
sd(smpl$Salary)
## [1] 5738254
hist(smpl$Salary)
boxplot(smpl$Salary)
smpl2 = smpl
As we suspected, the mean of the first sample (3.851130510^{6}) is not equal to the mean of the second sample (4.39832610^{6}). The values are close to each other. They are also close to the population mean (4.468069210^{6}).